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Qh 1 Introduction 

Coordination in a distributed system is facilitated if there is a unique process, 
V«0 the leader, to manage the other processes. The leader creates edicts and sends 

them to other processes for execution or forwarding to other processes. The 
leader may fail, and when this occurs a leader election protocol selects a re- 
placement. That protocol satisfies the following properties: 

O • Leader Uniqueness: At any time, at most one process is leader. 

• Edict Validity: Only leaders can create edicts. 

• Edict Ordering: Recipients of multiple edicts can determine the real-time 
order in which the edicts were created. 

> 

• Leader Stability: If a process is leader then, in the absence of failures and 
in the presence of timely communication and processing, it remains leader. 

• Eventual Election: If there is no leader then, in the presence of sufficient 
^\ timely communication and processing and a bounded number of failures, 

a leader is elected. 

t— I • Fault Tolerance: Some number of crash failures are tolerated. 

.^J • Efficiency: The time and storage, processing, and networking resources 

^ required by the protocol are reasonable. 

!_h 

d We assume that processes can exhibit failures: a process operates correctly 

until a failure causes that process to stop taking execution steps. Crashed 
processes are assumed to maintain the values of their variables although these 
variables are no longer accessible. The clock at a process is a variable and the 
exposition is simplified if that clock continues to advance even after the process 
has failed (and the clock is no longer accessible). Message delivery latencies and 
processing times are assumed to be unbounded. Message loss and reordering by 
the network is allowed, and network partitioning is permitted too. 

Nerio is a class of leader election protocols that implement these properties, 
Besides developing this class, we derive refinements for two plausible environ- 
ments: one assumes bounded drift of clock rate with respect to the rate of real 
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time; the second assumes bounded differences between clock values on any two 
processes at the same time. 

Nerio protocols are based on granting leases [6] and require that failure 
scenarios are characterized by quorum systems |10j . a combination first found 
in the leader election protocol of Fetzer and Siifikraut in [3] . But leader election 
properties (Leader Uniqueness, Leader Stability, and Eventual Election) alone 
offer little value, since a leader may no longer be the leader by the time it sends 
a message let alone when such a message is received by another process. Our 
Nerio protocols, which in addition satisfy Edict Validity and Edict Ordering 
properties, do provide value in asynchronous environments because edicts sent 
by leaders can be interpreted in the order of their creation, even if the processes 
that sent the edicts have ceased being leaders. 

Formalizing the Properties 

Consider a finite set of processes P — {p, ...}. Let isLeader p (t) be the property 
that, at time t, process p is leader. Formally, Leader Uniqueness is the following: 

Leader Uniqueness: 

Vp, q G P, t : (isLeader p (t) A isLeader q (t)) =>■ (p = q). (1) 

A leader can create edicts that it sends to other processes. For an edict e, 
define e. creator to be the process that created e, and e. created to be the real 
time at which e is created. (Note that even the process itself cannot know this 
time.) Only leaders can create edicts: 

Edict Validity: 

Ve : is Leader ecreator (e. created) (2) 

Edict Ordering means that 

• There is a total ordering -< on edicts. 

• For edicts e and e', e. created < el .created =>■ e -< e' . 

• Any recipient of edicts e and e' can ascertain whether e -< e! or e! -< e 
holds. 

However, Edict Ordering does not imply that receivers all receive the same 
set of edicts. Let Order p (e\ , e%) mean that process p received edicts e\ and e%, 
and believes that e\ was created before &2- Formally, Edict Ordering is the 
following: 

Edict Ordering: 

Vp,ei,e2 : Order p (e\,e2) ei. created < C2-created (3) 
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In order to formalize Leader Stability and Eventual Election formally, we 
assume that there is a time after which message latencies between correct pro- 
cesses are bounded by a known constant d, and there are no more failures. We 
call this the Global Stabilization Time (GST). We do not know when GST is, only 
that it will happen eventually. Then we can have the following properties: 

Leader Stability: 

3GST: Vti,t 2 > GST, p <E P : (isLeader p (ti) A*i < t 2 ) => isLeader p (t 2 ) 

(4) 

Eventual Election: 

3GST:3t> GST,p e P : isLeader p (t) (5) 

In Section [2j we describe the Nerio class of leader election protocols that 
leverage the properties of quorum systems instead of requiring accurate failure 
detection. Section [3] describes a protocol in this class; it assumes bounded clock 
drift. Section [4] describes another protocol that assumes that there is a bound 
on how much two clocks may differ. We compare the two protocols in Section [5j 
In Section [6] we show how a process can give up its grants to a lease if so 
desired. Section [7] shows how Nerio protocols support Edict Validity and Edict 
Ordering. We show that the protocols satisfy Leader Stability in Section [8j 
while Section [9] demonstrates that the protocols satisfy Eventual Election. A 
discussion of various issues follows in Section [10] Section [TT] discussion prior 
work. 

2 A Class of Leader Election Protocols 

Let Q be a quorum system on P. That is: Q is a set of process sets such that 

VQ e Q : Q C P (6) 
VQi,Q 2 GQ:QinQ 2 ^0 (7) 

An oft-used quorum system consists of all subsets that are majorities in P, that 
is, VQ e |Q| > |P|/2. 

Each process p has the following state variables (we use upper case characters 
to denote local variables): 

C p (clock) : a monotonically increasing clock at process p; 
A p (assignee): a process, initially p itself; 

F p (finish): a clock value measured on the clock of process p, initially 0; 

E p (expiration) : another clock value measured on the clock of process p, 
initially 0. 
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If X p is a local variable at process p, then we write X p (t) for the value of X p at 
real time t. 

Assume that C p (t) is continuous and satisfies the following two conditions, 
which should hold for the clocks found on real processes: 

Monotonicity: 

Vti,t 2 :ti <t 2 ^C p (h) <C p (t 2 ) (8) 

Growth: 

VT > : 3t : C p (t) > T (9) 

In practice, the hardware clock increases in a stepwise fashion rather than 
continuously. This is not observable if a clock has a sufficiently high resolution 
relative to the speed at which processes advance. A process can only sample its 
clock, so by obtaining a value T the process only learns that between the time 
that the process requested the sample and the time that it obtained the sample, 
the value of the clock was T. We assume that the clock advances from one 
sample to the next. This can be ensured by making the clock a pair consisting 
of the hardware clock and a counter that is reset each time the hardware clock 
advances and is incremented each time the clock is sampled. This composite 
clock is then ordered lexicographically. Because of the asynchronous nature of 
our system, an arbitrary interval may have elapsed between when the sample 
sample is taken and when it is returned to the process. Therefore, a process 
cannot tell the difference between a clock that increases continuously, and one 
that does not. 

Assuming that C p increases we can define an inverse function c p (T) on clocks 
with the following properties: 

C p (c p (T)) = T (10) 
c p (C p (t)) = t (11) 

Lemma 2.1 

Vp e P, t, T : C p (t) <T<^t<c p {T) 

Let 7p, g (i) be the predicate 

7 Pj ,(t) = A q {t) =p A C q (t)<F q (t). 

If "f p , q (t) holds, we say that, at time t, process q grants a lease to process p. 
Note that a process cannot grant a lease to two different processes at the same 
time t, because a variable (e.g., A q (t)) can have only one value at time t. 
We can now define formally what it means to be leader: 



isLeader p (t) = 3Q £ Q : (Vg € Q : 7 Pl9 (i)) 



(12) 



That is, process p is leader at time t iff a quorum of processes grant a lease to p 
at time t. It should be clear to the reader that (12) implies ([lj: because of the 
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intersection property of quorums (Equation ^fj) there cannot be two different 
quorums, one in which all processes are granting a lease to pi, and another 
quorum in which all processes are granting a lease to a different process pi , at 
the same time. 

We need an implementation of isLeader p . Each process p has a variable 
E p , which gives an expiration time of p's leadership, initially 0. Like F p , E p is 
measured on p's clock. In Nerio protocols, the following invariant holds: 



Vp,t : (C p (t) < E p (t)) isLeader p (t) (13) 

(The implication holds only in one direction because, as we shall see, pro- 
cesses extend their grants conservatively, and thus it may be that a quorum of 
processes are still granting a lease to p after p gives up on the lease.) 

Combining ( 13 ) and ( 12 1 and substituting 7 P , g (i), we get the following prop- 
erty: 



Wp,t : (C p (t) < E p {t)) => (3Q E Q : (Vq E Q : A q (t) = pA C q {t) < F q (t))) (14) 

We take this as the defining characteristic of a Nerio class leader election 
protocol. 

At this point it is useful to consider what happens if a process crashes. 
By the Growth condition (Equation j9j|), the clock of the process continues 
increasing. We need this in order to ensure that if a crashed process p was a 
leader, eventually it stops being leader (because C p (t) < E p {t) becomes false), 
and if a crashed process q granted a lease, eventually this lease expires (because 
C q (t) < F q (t) becomes false). Since a crashed process cannot produce any 
output, having the clock stop is indistinguishable from a clock that continues 
to increase Q 



Below we will show examples of protocols that maintain (14 1, given certain 
assumptions about the environment. 



3 Clocks with Bounded Drift 

Assume the drift (accuracy of rate) of each clock is bounded by a constant p 
per time unit. That is: 

V P EP,t,5: C p {t) + (1 - p)5 < C p (t + 6) < C p {t) + (1 + p)S. (15) 

In other words, during a real-time period 5, the clock of a process may advance 
by as little as (1 — p)S, or as much as (1 + p)S. 

In the Nerio class protocol that we derive in this section, a process p never 
decreases F p . Thus the protocol maintains the following invariant: 

Vp e P,h,t 2 : h < t 2 => F p (h) < F p (t 2 ). (16) 

In practice, a hardware clock often continues to increase for some amount of time as it 
is backed up by an internal battery. 
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Furthermore, consistent with the meaning of a lease, a process p never changes 
A p if C p < F p . As a result, once a process p has granted a lease to A p , this 
grant remains until real time c p (F p ). 

A process p trying to become leader (or extend the period during which it 
is leader) executes the following algorithm, which we call Obtain Quorum with 
Bounded Drift, or OQwBD for short. Process p uses a temporary Start p into 
which it stores the starting time of the algorithm: 

1. set Startp := C p (sample starting time); 

2. select a real time period 8, 8 > 0; 

3. broadcast (grantRequest,p, Start p , 8). 

Upon receipt of a grantRequest message, a process q does the following: 

4. T q := C q (save local time into a temporary variable T q ); 

5. if p ^ A q A T q < F q , then ignore the request (q is already granting a lease 
to A q , A q ^p); 

6. otherwise 

6.1. A q := p: F q := max(Fg, T q + {1 + p) ■ 8); 

6.2. send (ok, g, Start p ) to p. 

Meanwhile, process p waits for ok messages: 

7. wait for a (ok, q, Startp) from each process q in a quorum of Q or until 
Cp > Startp + (1 - p) ■ 8; 

8. if ok messages are received from a quorum and C p < Start p + (1 — p) ■ 5, 
then E p := Startp + (1 — p) ■ S (we say that the OQwBD algorithm completed); 

9. if not a sufficient number of ok responses are received by C p > Startp + 
(1 — p) ■ 6, then this instantiation of OQwBD failed. 

By measuring (1 — p) ■ 8 on its local clock, process p will stop believing it is 
leader before at most S real time units have expired since p initiated OQwBD. 
A process q, by measuring (1 + p) ■ 8, grants the lease for at least 8 real time 
units since process p started OQwBD. (Process q calculates a maximum in order 



to ensure that F q can only progress forwards as required by (16)). 

Process p can run OQwBD at any time. We say that p aborts OQwBD if p starts a 
new execution before the current one completed. Once aborted, responses for the 
earlier instantiation of OQwBD will be ignored. The clock value Start p is included 
in the messages only to identify an instantiation of OQwBD (the tuple (p, Start p ) 
uniquely identifies an instantiation of OQwBD); receivers do not interpret the clock 
value, but return it in the response. This way, process p can ignore responses 
of aborted instantiations. 
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Figure 1: Example of a protocol exchange between a process p that initiated the 
protocol and a process q that granted the lease and is included in the quorum 
that p uses to complete the protocol. For clarity, the drift p — 0. Dashed 
horizontal lines indicate real time. The labels to the left are clock values on p's 
clock; the labels on the right are clock values on g's clock. 
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To prove that (|14| holds for any p and t, consider a process p and a time t 



for which C p (t) < E p (t) holds, and note that t < c p (E p (t)) (Lemma 2.1). Let 
t' be the time at which the last instantiation of OQwBD completed at process p. 
Note that p updates E p only at this time, and since this is the last instantiation 
of the protocol, we have the following: 



E p (t) = E p (t>) (17) 

Let t p be the time at which p assigned Start p in this last instantiation of 
OQwBD, and thus 



Start p (t') = Startp(tp) (18) 

We have to show that there exists a quorum Q so that Vq e Q : A q (t) — 
p A C q {t) < F q (t). We show that the quorum that responded to p and caused 
p to complete OQwBD is such a quorum. 

Let Q be the quorum that responded to p. Consider a process q £ Q and 
let t q be the time at which some process q sampled the local clock resulting in 
a value T q , so that T q = C q (t q ). Note that 

tp < t q < t' < t < C p (E p (t)). (19) 

(See Figure fl] for an illustration in the case p = 0.) 



Lemma 3.1 c p (E p (t)) < c q (F q (t)). 



Proof 






(1) 


E p (t)=E p {t') 




(Equation (IT])) 


(2) 


E p (t') = Start p (t') + (1 - p)5 


(Algorithm OQwBD) 


(3) 


Start p (t') — Startp(t p ) 




(Equation |l~8|)) 


(4) 


Startp{t p ) = Cp{t p ) 




(Algorithm OQwBD) 


(5) 


C P (t P ) + (1 - p)5 < Cp(t p H 


-S) 


(Equation ([15]) ) 


(6) 


E p {t) < C p (t p + 6) 




(Combining (2) thru (5)) 


(7) 


Cp(E p (t)) <t p + S 




fLemma [2~Tj) 


(8) 


Fq(t) > F q (tq) 




(Equations (19) and ITo)) 


(9) 


Fq(t q )=C q (tq) + (l+p)6 




(Algorithm OQwBD) 


(10) 


C q {tq) + (l+p)S>Cq(tq^ 


-5) 


(Equation |l~5|)) 


(11) 


Fq(t) > Cq(t g + 6) 




(Combining (8), (9), (10)) 


(12) 


tq + S < Cq(F q {t)) 




fLemma |2~T1) 


(13) 


(tp +S)< (tq + S) 




(Equation |l9|)) 


(14) 


Cp(E p (t)) < (t p + 5)< (tq 4 


-5)<Cq(F q (t)) 


(Combining (7), (12), (13)) 



I 
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It remains to show that A q (t) = p A C q (t) < F q (t). This follows directly 
from t q <t'<t< c p (Ep(t)) < c q {F q {t)). After assigning A q and F q between t q 
and t', A q cannot be changed until c q {F q {t)) at the earliest. 

Note that no effort is made to detect process crashes. If isLeader p {t) at the 
time a process p crashes, that process continues to be leader until there is no 
longer a quorum of processes that grant a lease to p. 



4 Clocks with Bounded Skew 

Our second instance of a Nerio class leader election protocol is similar to the 



first, but instead of assuming bounded drift (Equation (15l) we assume that 
clocks at any two processes always differ by at most A: 

Vp, q, t : -A < C p (t) - C q (t) < A. (20) 

We present a new algorithm called Obtain Quorum with Bounded Skew, or 
OQwBS for short. (Skew is the difference between two clock values at the same 
time.) The variables of OQwBS are the same as those of OQwBD. A process p can 
initiate OQwBS as follows: 

1. set Startp := C p (sample starting time); 

2. select a time period 6, 8 > 0; 

3. broadcast (grantRequest,p, Start p , 8). 
Upon receipt, a process q does the following: 

4. T q := C q (sample local time); 

5. if p 7^ A q A T q < F q , then ignore the request; 

6. otherwise 

6.1. A q := p\ F q := max(F„ Start p + 8 + A); 

6.2. send (ok, q, Startp) to p. 

Note that because any two clocks differ by at most A, q can interpret Start p 
with respect to its own clock. As before, process p waits for ok responses: 

7. wait for a (ok, q, Start p ) from each process q in a quorum of Q or until 
Cp > Startp + 6; 

8. if ok messages are received from a quorum and C' p < Start p + 5, then 
E p := Startp + S (we say that the OQwBS algorithm completed); 

9. if not a sufficient number of ok responses are received by C p > Start p + 5, 
then this instantiation of OQwBS failed. 

Again, the proof of Leader Uniqueness is based on showing that c p (E p (t)) < 
c q (F q (t)), and it is easy to see why this is true. 
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5 Comparison 



In the OQwBD protocol of Section [3j a process q may grant a lease for a process p 
long beyond c p (E p (t)) if the grantRequest message to p is delayed that much. 
If p has failed, this grant could prevent other processes from becoming leader. 
It thus appears that the OQwBS protocol of Section[4j which is based on bounded 
skew, has an important advantage. However, below we will argue that in practice 
bounded drift is more likely to be guaranteed than bounded skew, so DQwBD is 
likely to be more robust in practice. 

Hardware clock manufacturers often specify a bound on clock drift, and 
this bound is typically within the range of 10~ 7 to 10~ 5 given a sufficiently 
stable temperature within the casing of a computer chassis. For performance 
measurements, in which it is necessary to measure the passage of time, rather 
than to tell what time it is, operating systems usually provide access to the raw 
clock value, as opposed to one that may be adjusted by a clock synchronization 
protocol attempting to reduce skew. 

Under virtualization, the hardware clock may be virtualized, and drift would 
no longer be bounded. Fortunately, Xen allows guests to sample the hardware 
clock. Under VMware, the hardware clock is not directly accessible. Fortu- 
nately, VMware does make CPU performance counters accessible, including a 
way to measure the passage of time. If this facility documents a bound on drift, 
then this is enough for our purposes. However, if a virtual machine is migrated, 
a clock may jump arbitrarily, violating the assumptions that we make on clocks. 

The protocol based on bounded skew allows processes to leverage a bound A 
to avoid a process granting a lease more than A beyond c p (E p (t)). Bounded skew 
requires a clock synchronization algorithm. Clock synchronization algorithms 
require bounded latency on communication and bounded execution times, in 
addition to requiring bounded clock drift. In the absence of such bounds, clock 
synchronization algorithms such as NTP provide, at best, probabilistic bounds 
on skew (with unspecified probability). 

Below we will only assume bounded clock drift and not bounded skew, al- 
though the results are generalized easily. 

6 Releasing Grants 

It is sometimes useful for a process to give up the grants it received. For example, 
if a process is not able to obtain grants from a quorum, and thus does not have 
a lease on leadership, then it might as well give up the grants that it has so that 
perhaps another process can be more lucky. Even if a process did obtain a lease 
and became leader, it may for some reason give up its leadership by releasing 
its grants. In this section we will show how this can be done without violating 
invariant 1141 

A process p that wants to release its grants first aborts any instance of DQwBD 
that it may be running. Second, process p sets E p (t) to C p (t). We note that 
it is always safe for a process p to do so as this cannot affect the validity of 
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invariant 14 If p was leader, it will no longer be leader as a result. So at this 
point, p is neither leader nor is it trying to become one. 

Next p broadcast a request to all peers to release its grant. A process q that 
receives such a message from p will check to see \ip — A q . If not, it ignores the 
request. If so, it will set F q to C g , causing the grant to expire immediately, even 



if F q > C q . The reader will notice that this violates invariant 16 which was 
used in Lemma |3T| to proof that c p (E p (t)) < c q (F q (t)). However, this lemma 
was only shown to hold when C p (t) < E p (t), and because p has reset E p to C p , 
this precondition no longer holds. Invariant |16| is not used elsewhere, and thus 
expiring the grant does not violate invariant |14| 

7 Edicts 

Leaders create edicts, which they send to one or more processes. Because of a 
lack of assumptions about message latencies and process execution speeds, such 
an edict may take an arbitrary amount of time to arrive at a process, and may 
even be lost in the network. Edicts may also be stored or forwarded. So an old 
edict may be delivered after an edict that was created more recently, possibly 
by a different leader. Edict Ordering prevents chaos: it ensures that any two 
different edicts can be compared and ordered in a manner consistent with the 
real times of their creation. 

For this ordering to make sense, the time an edict is created must be defined 
properly. The last step by a process p creating an edict e is to sample C p , 
obtaining a value T = C p (t). In order to ensure Edict Validity, the process 
determines if T < E p . If so, then t is the creation time of edict e, that is, 
e. created — t. (Unfortunately, even the leader itself cannot determine t.) If not, 
then the edict creation fails, because at time t, process p may not have been 
leader. 

We describe how Order p (e\, e<i) in Equation ^ is implemented for algorithm 
OQwBD, but the idea does not depend on specifics of OQwBD. Extend the ok 
response (Step 2) from q with T q , such that q sends (ok, q,T q , Startp) to p. 
Process p awaits messages from all processes in a quorum Q £ Q, and constructs 
a Quorum Timestamp QT p as the set of pairs {q,T q ) for all q € Q. In addition, 
process p maintains an Edict Counter EC P , initially 0, counting the number of 
edicts created by p. 

Every time p creates an edict, it tags that edict with an Edict Timestamp 
(QT p , EC P ) and increments EC p . Before sending the edict, process p checks to 
see if C p < E p to make sure it is still leader. If not, the edict is not valid and 
should be discarded. 

We define an ordering on Edict Timestamps and show it consistent with the 
real time in which the edicts were created. Edict timestamps are lexicographi- 
cally ordered, first by quorum timestamp and then by the natural ordering on 
edict counters. Quorum timestamps are ordered as follows: 

QT X < QT 2 «■ (3q, Ti , T 2 : fa.Ti) G QT X A (q,T 2 ) e QT 2 A T± < T 2 ) (21) 



11 



We show that this ordering is consistent with the creation times of edicts. 
Let X be a completed instantiation of OQwBD. X has the following attributes: 



X. owner 


the process that initiated X and became leader 


X. start 


the real-time when X started (i.e., cx 


owner(Startx . owner) ) 


X. completion 


the real-time when X completed 


X.QT 


the quorum timestamp that X. owner \ 


generated 


X. expiration 


the real-time when X expires (i.e., cx 


owner(Tj x . owner) ) 



Some trivial observation about such an X are: 

X. start < X. completion < X. expiration (22) 

Vt : (X . completion <t< X . expiration) =>■ isLeaderx.owner(t) (23) 

Vq,T : (q,T) £ X.QT^> X. start < c q (T) < X. completion (24) 

We order instantiations by their completion time, that is, X < X' X. completion < 
X' .completion. 

Lemma 7.1 WX, X' : X < X' ^> X. QT < X' . QT. 

Proof By contradiction, assume there can exist an X and X' such that X < X' 
(and thus X. completion < X' .completion) and —<(X.QT < X' .QT). Because 
quorums overlap, there must exists a q, T, T' such that (q,T) £ X.QT and 
(q,T') G X' .QT. By assumption, T > V (for otherwise X.QT < X'.QT). We 
consider two cases. 

• X. owner = X' .owner: Then it must be the case that X. completion < 
X' .start (or X would have been aborted and could not have completed). 



From (24) it must be that T < T' , contradicting the assumption that 
T > T . 

• X. owner X' .owner: From time c q (T) until X. completion (and be- 
yond), q has granted a lease to X. owner, and similarly, from c q (T') to 
X' .completion, q has granted a lease to X' .owner. Because X. completion < 
X' .completion and c q (T) > c q (T'), it must be the case that at time 
X . completion, process q has granted a lease both to X. owner and X'. owner. 
But a process cannot grant leases to two different processes at the same 
time. | 

Note, as a corollary, that quorum timestamps are well-ordered, consistent 
with the ordering on instantiations of DQwBD. 



8 Leader Stability 

When there is a leader, Leader Stability implies that the leader persists in that 
role in the absence of failures and while messages are delivered and processed in 
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a timely fashion. Suppose that message round-trip time is bounded by a known 
constant d. In that case, if leader p starts OQwBD before c p (E p ) — d, then it is 
able to extend its leadership before it expires. So p should use S > 2d in order 
that in the next period of leadership it is able to do so again. Choices of 6 are 
discussed in Section HO.il 

9 Eventual Election 

If there is no leader (i.e., Vp : C p > E p ) then multiple processes could try to 
become leader. There is no guarantee that any will succeed, however. But 
if we could somehow ensure that only one process p executes OQwBD, and the 
process waits long enough to do so (so that all F*'s have expired), then OQwBD 
is guaranteed to succeed eventually (after GST). This would seem to create a 
circularity, as as choosing p requires solving leader election. The way out is to 
use a weak version of leader election (which may select multiple weak leaders) 
in order to make successful completion of OQwBD likely. The more likely it 
is that weak leader election selects only a single weak leader, the more likely 
an instantiation of OQwBD terminates successfully. In addition, for Eventual 
Election to hold, after GST the weak leader election protocol is required to 
produce a single weak leader. 

Here is such a weak leader election algorithm: Assume processes in P are 
ordered, that is, p < q < and elect the smallest process in P that has not 
failed. To this end, processes are organized into a virtual ring in order. The 
scheme uses a failure detection algorithm such as simple pinging or the more 
sophisticated 0-accrual failure detector [7] that gives a better approximation of 
the failure status of processes. Each process pairs with the closest predecessor 
and closest successor on the ring that it considers correct by the failure detector, 
and monitors it. If a process q believes it is the lowest correct process (because 
the identifier of its predecessor is larger than its own) , then it considers itself a 
weak leader. Note that under the properties of GST, failure detection becomes 
accurate and the algorithm will produce a single leader. 

A weak leader initiates OQwBD to try to become a leader if A q ^ q AC, < 
F q (i.e., it is not currently granting a lease to another process). It does so 
periodically in order to deal with possible collisions and message loss. 

The Eventual Election property states that under the conditions that hold 
after GST, a leader will eventually be chosen by the Nerio protocol if there is none 
yet (and, because of Leader Stability, it will remain leader henceforth). To see 
why Eventual Election holds for the presented protocols, note that eventually 
only one process will attempt to become leader because of the properties of weak 
leader election. After all conflicting grants have expired, and because round-trip 
latencies are bounded by d, eventually this process will complete OQwBD. 
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10 Discussion 



10.1 Choice of S 

A process that initiates OQwBD chooses some value for 8. No matter what value 
of 8 is chosen, Leader Uniqueness will hold, but choosing 8 too small could 
adversely affect Leader Stability and Eventual Election. Therefore 8 should be 
chosen large enough so a leader can extend the period of its leadership without 
interruption, but short enough so that recovery can be swift after the leader 
fails. 

Suppose d is an estimate for the round-trip time, and represents that, say, 
in 99.9% of round-trips the round-trip latency is less than d. A leader might 
initiate OQwBD to extend its lease before E p — (1 + p) ■ d measured on its local 
clock. Clearly, 8 should be chosen larger than d plus the time that remains on 
the lease, which can be conservatively estimated by p as (1 + p) ■ (E p — C p ). 

In practice, d is likely no larger than a few milliseconds on today's hardware, 
assuming processing of Nerio messages receive a high priority, and p is likely no 
more than 10 microseconds. But choosing (5 as small as possible would likely 
result in too many round-trips per second. If we want space out instantiations 
of OQwBD by at least i time units, then we should choose 8 = d + max(«, (1 + p) ■ 
{E p -C p )). 

10.2 Interference 

If more than one process concurrently tries to become leader, then none may be 
able to enlist a quorum. They each would then have to wait to let conflicting 
grants expire before attempting to rerun OQwBD. 

In both proposed Nerio protocols, processes respond to the initiator only if 
they grant the lease request. But there is something to gain if the protocols 
are modified so that if a process has granted a lease to another process, then 
instead of just ignoring the grant request, it responds with an error message. 
The error message helps an initiator to determine if there is hope of obtaining 
a quorum. 

The protocols can be extended with revocation requests to further avoid 
interference. This further extension requires that grant and revocation requests 
from the same source are delivered in FIFO order. When a process q receives 
a revocation request from a process p, and if A q = p A C q < F q , then q sets F q 
to C q , thereby releasing its grant to p. (The FIFO order ensures that delayed 
revocation request do not inadvertently revoke outstanding grants.) Obviously, 
a process p that sends a revocation request must first have aborted the protocol, 
thus even if it ends up collecting positive responses from a quorum, E p should 
not be advanced. 
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10.3 Network Partitioning 



Nerio class protocols work work even if the network partitions if there is a 
partition that contains a quorum of correct processes. And if there is no such 
partition or if functionality is desired in minority partitions (i.e., partitions that 
do not hold a quorum of processes), then the weak leader election algorithm 
might be used to assign a temporary, non-authoritative, leader in each partition 
that can provide partial functionality. 

10.4 Finding the Leader 

What if an external client, seeking that an edict be issued, sends a request to a 
process p in P but p is not currently leader? If process p has an unexpired grant 
for another process q, then process p can respond by giving q as a forwarding 
address. If not, process p may attempt to become leader. Failing that, p may 
buffer the request until a leader emerges, or return an error response. 

10.5 Leader Verification 

A process q may want to check whether some other process p is leader. The 
following protocol, based on bounded drift, will accomplish this: 

1. set Start q := C q (save starting time); 

2. send (verif yLeadership, q, Start q ) to p. 

Upon receipt of a verif yLeadership message, a process p does the following: 

3. calculate 8 := (E p - C p )/(p + 1); 

4. send (remainder,/), S, Start q ) to q. 

Here 5 equals the minimal amount of real time that is left of p's leadership. 
Note that if p is no longer leader, 6 will be negative. If q receives the response, 
it calculates T = Start q + S- (p — 1), and as long as T < C q holds, p is guaranteed 
to be leader (and possibly a bit longer than that depending on rate differences 
between C p and C q ). 

10.6 Changing Membership 

Nerio protocols can be adapted to handle the case where P changes over time. 
We introduce epochs, numbered consecutively starting at 0. Each epoch e is 
associated with a set of processes P e and quorum system Q e defined on P e . For 
simplicity, assume different epochs have non-overlapping sets of processes: 

Ve, e' : e ^ e' P e n P e , = 0. (25) 

(In practice, a process that is a member of more than one epoch should maintain 
different copies of its state variables for each epoch.) 
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Each epoch is defined to be PENDING, RUNNING, or TERMINATED. Each epoch 
starts in the PENDING state, except for epoch which starts in the RUNNING 
state. At any point in time, at most one epoch is in the RUNNING state, and all 
prior epochs are TERMINATED. 

If epoch e is RUNNING, it can be terminated by getting each process q in some 
quorum of Q e to set A q — _L A F q = oo. (A process can only do so if there is no 
current outstanding grant.) We say that process q is wedged if A q = _L A F q = oo 
holds. Once a quorum of processes are wedged, no process can become leader 
in that epoch. At this same time, epoch e + 1 automatically becomes RUNNING. 
That is, an epoch is defined to be RUNNING iff all prior epochs are TERMINATED 
and no quorum of processes are all wedged in that epoch. 

A process p in epoch e + 1 ignores grant requests, and docs not send any, 
until it has learned that epoch e is TERMINATED. Process p can learn that e is 
TERMINATED by querying processes in a quorum of Q e and detecting that these 
are wedged, or by receiving a grant request from a process in P e +i- 

Note that isLeader p (t) will hold only if epoch e is RUNNING at time t and p £ 
P e holds. Because at most one epoch is RUNNING, Leader Uniqueness continues 
to hold, even given multiple epochs. 

Leader Stability no longer makes sense because epoch memberships are non- 
overlapping. However, an epoch e + 1 that wants to start running could have 
a particular process p G P e +i be in charge of wedging the processes in P e , by 
sending a (grantRequest, _L, Start p , oo) message to these processes, and upon 
obtaining ok responses from a quorum of those processes, send a regular grant 
request to the processes of epoch P e +i- Thus the new epoch has significant 
control over which process it wants to be leader initially. 

Once a process receives a grant request with d = oo, but has an outstanding 
(normal) grant request, it could buffer the special grant request and grant it 
upon expiry of the current lease. In that case, if at most a quorum of processes 
in a RUNNING epoch are faulty, eventually the epoch will become TERMINATED 
and the processes of the next epoch will be able to learn so. Therefore Eventual 
Election also continues to hold. 

The reconfiguration protocol should be invoked when processes are suspected 
of having crashed, or eventually there may no longer be a quorum available to 
elect a leader. The reconfiguration protocol can only make progress if a quorum 
in Q e is correct and can be wedged, and thus if too many processes crash, it is 
no longer possible to reconfigure. Under manual intervention, an administrator 
could explicitly mark certain processes as having failed. The quorum system 
could then be adjusted with smaller quorums in order to make progress. 

Note that edict timestamps can be extended with epochs in order to make 
sure the Edict Ordering continues to hold. 

11 Related Work 

Leader election is used in practical systems. For example, the IEEE 1394 
"Firewire" serial bus standard, for the purpose of coordination among devices, 
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includes such a protocol that creates a spanning tree of devices with a unique 
root acting as leader. Early work on leader election focused on efficiently finding 
extremas (the node with the minimum or maximum identifier) in a connected 
network topology of unknown size. The problem was apparently first formulated 
and solved in 1977 by Gerard LeLann [8]. Many papers on this subject have 
appeared since. 

In 1982, Hector Garcia-Molina defined the problem of leader election in a 
distributed system that admits failures [5], and presented protocols. That paper 
includes separate definitions for synchronous and asynchronous systems. For a 
synchronous system, Garcia-Molina's definition of leader election requires that 
there be at most one leader at at time, and in the absence of failures a leader 
is elected within a fixed time limit. For an asynchronous system, the definition 
applies only to those nodes that experience synchronous communication — the 
other nodes may end up with different leaders. 

Consensus protocols [5] can be used to solve leader election in both syn- 
chronous and asynchronous systems. Each participant proposes itself as leader, 
and the consensus protocol subsequently decides on one of the proposals. Divid- 
ing time into time slots, an instantiation of consensus could be used for each time 
slot. Doing so would lead to unnecessarily high overhead, and many consensus 
protocols rely on leader election themselves, creating a circularity. 

Fueled by leader-based consensus protocols, many papers discuss leader elec- 
tion in partially asynchronous systems. In this formulation, a protocol may 
output multiple leaders, but there must exist a time after which the protocol 
output exactly one leader. In asynchronous environments these protocols are 
probabilistic, producing a single leader in case the environment is reasonably 
timely, but that may produce multiple leaders in case the environment is not. 
We call this weak leader election, but it is also referred to as local leader election. 

Weak leader election in asynchronous environments is closely related to the 
failure detection problem, whereby a leader is the node with the lowest (or 
highest) identifier that is not suspected of having failed. Fetzer and Cristian [3 
study the problem of weak leader election in partitionable networks, and use 
a technique based on leases [5] (to define partition boundaries). Stable (but 
weak) leader election was considered in [T] . A performance comparison of three 
recent stable leader election algorithms appears in [9 . This paper also consider 
dynamic membership. 

The problem of strong leader election in a partially synchronous environment 
was discussed by Fetzer and Siifikraut in [3]. Their protocol uses leases and 
quorums. The Nerio protocols described in this paper generalize this idea by 
defining an invariant (Equation (14)) that all such protocols must satisfy, and 
can be used to transform any weak leader election protocol into one that is both 
strong and stable, and support dynamic membership. 
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